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Abstract 

We present an approach to semi-supervised learn- 
ing based on an exponential family characteriza- 
tion. Our approach generalizes previous work on 
coupled priors for hybrid generative/discriminative 
models. Our model is more flexible and natural 
than previous approaches. Experimental results on 
several data sets show that our approach also per- 
forms better in practice. 

1 Introduction 

Labeled data on which to train machine learning algorithms 
is often scarce or expensive. This has led to significant in- 
terest in semi-supervised learning methods that can take ad- 
vantage of unlabeled data [Cozman et al, 2003; Zhu, 2005]. 
While it is straightforward to integrate unlabeled data in a 
generative learning framework [Nigam et al., 2000], it is 
not so in a discriminative framework. Unfortunately, it is 
well-known both empirically and theoretically [Ng and Jor- 
dan, 2002] that discriminative approaches tend to outperform 
generative approaches when there is enough labeled data. 
This has led to many recent developments in hybrid genera- 
tive/discriminative models that are able to leverage the power 
of both frameworks (see Section 4). One particular such ex- 
ample is the work of Lasserre et al. [2006], who describe a 
hybrid framework ("PCP") in which a generative model and 
discriminative model are jointly estimated, using a prior that 
encourages them to have similar parameters. 

In this paper, we generahze the parameter coupling prior^ 
(PCP) method [Lasserre etal, 2006] to arbitrary distributions 
belonging to the exponential family. Unlike the PCP method, 
we do not restrict ourselves to the Gaussian prior, but in- 
stead choose a prior that is natural to the model. Other au- 
thors [Bouchard, 2007] have also noted the inappropriateness 
of the Gaussian prior to couple the generative and discrim- 
inative models. Our resulting approach for hybridizing dis- 
criminative and generative models is: (1) not restricted to a 
particular class of the models; (2) more flexible in choosing 

'Terms parameter coupling prior and coupled prior were intro- 
duced in [Druck et al, 2007] and do not appear in [Lasserre et al, 
2006] though they refer to the framework introduced in [Lasserre et 
al, 20061. 



the way these two models can be combined; (3) enables us to 
achieve a closed form solution for the generative parameters, 
unUke PCP method, where one has to resort to the numeri- 
cal optimization. We demonstrate our framework on using a 
Beta/Binomial conjugate pair on the text categorization prob- 
lems addressed by Druck et al. [2007]. 

2 Background 

In general, machine learning approaches to classification can 
be divided into two categories: generative approaches and 
discriminative approaches. Generative approaches assume 
that the data is generated though an underlying process. One 
simple example is document categorization: for each exam- 
ple (x, 1/), we first choose a category y, and then produce a 
document x conditioned on the category y. The goal in gener- 
ative modeling is to approximate the joint distribution p(x, y) 
that represents this process. On the other hand, discrimina- 
tive approaches do not assume any underlying process and 
directly model the probability of category given the document 
p(y|x). Ng and Jordan [2002] compare these two approaches 
and show that while discriminative models are asymptotically 
better than generative models, generative models need less 
data to train. 

In semi-supervised settings, one has access to lots of un- 
labeled data but only a small amount of labeled data. It is 
easy to see that unlabeled data is not directly useful in a 
discriminative setting but can be easily used in generative 
setting. However, since discriminative methods asymptoti- 
cally tend to outperform generative methods [Ng and Jordan, 
2002], this naturally leads to combining these two approaches 
and building a hybrid model that does better than the indi- 
vidual models. Earlier work [Bouchard and Triggs, 2004; 
Lasserre et al, 2006; Druck et al., 2007] has shown the ef- 
ficacy of the hybrid approach. 

2.1 Exponential Family and Conjugate Priors 

For the sake of completeness, we briefly define the exponen- 
tial family which we will use as the basis of our hybrid model. 
The exponential family is a set of distributions whose prob- 
abiUty density function^ can be expressed in the following 



^"Density" can be replaced by "mass" in the case of discrete ran- 
dom variable. 



form: 



/(x; e) = /i(x) exp((r?(e)T(x)) - A{e)) 



(1) 



Here r(x) is sufficient statistics, ri{9) is a function of natu- 
ral parameters 0, and A{0) is a normalization constant (also 
known as log-partition Junction). 

One important property of the exponential family is the ex- 
istence of conjugate priors. Given any member of the expo- 
nential family in Eq (1), the conjugate prior is a distribution 
over its parameters with the following form: 

p{e\a, 13) = m{a, /3) exp((r7(6l), a) - l3A{e)) 

Here (a, h) denotes the dot product of vectors a and h. Both 
a and j3 are hyperparameters of the conjugate prior. Impor- 
tantly, function A( ) is the same between the exponential fam- 
ily member and the conjugate prior. 

A second important property of the exponential family is 
the relationship between the log-partition function A{9) and 
the sufficient statistics. In particular, we have: 



OA 



= E,[T(x)] 



(2) 



2,2 Hybrid Model with Coupled Prior 

We first define the problem and some of the notations that we 
will use through-out the paper. Our task is to leam a model 
that predicts a label y given an example x. We are given 
the data D — Dl U Djj where Dl represents the labeled 
data and Du represents the unlabeled data. Each instance 
of the labeled data consists of a pair (x, y) where x is fea- 
ture vector and y is the corresponding label. Each instance 
of unlabeled data consists of only feature vector x. The xs 
are M-dimensional feature vectors, and x^ denotes the 
feature. 

We now give a brief overview of the hybrid model pre- 
sented by Lasserre et al. [20061. The hybrid model is a mix- 
ture of discriminative and generative components, both of 
which have separate sets of parameters. These two sets of 
parameters (hence two models) are combined using a prior 
called coupled prior. Considering only one data point (the 
extension to multiple data points is straightforward and pre- 
sented later), the model is defined as follows: 

p{^,y,e,e) = p{e,e)p{y\^,e)p{ic\e) 

= pie,e)p{y\^,e)Y,p{^,y'\o) 

V' 

Here 6' is a set of discriminative parameters, 6 a set of gen- 
erative parameters, and p{9, 9) provides the natural coupling 
between these two sets of parameters. p(y|x, 6*) is the dis- 
criminative component; p(x|^) = X^^/ p(x, y']^) is the gen- 
erative component. 

The most important aspect of this model is the coupled 
prior p{9, 9), which interpolates the hybrid model between 
two extremes; generative model when 9 = and discrimina- 
tive when 6 is independent of 0. In other cases, the goal of the 
coupled prior is to encourage the generative model and the 
discriminative model to have similar parameters. In earlier 



work [Lasserre et ai, 2006; Druck et al., 2007], a Gaussian 
prior was used as the coupled prior: 

1 ~ 2" 
■2^11^-^11. 

Unfortunately, the Gaussian prior is not always appropriate 
[Bouchard, 2007]. 



p{0, 0) oc exp 



3 Exponential Family Hybrid IModel 

In this section, we provide a more general prior for the hybrid 
model that is not only mathematically convenient but also al- 
lows choosing a problem specific prior. 

3.1 Exponential FamUy Generalization 

First, we generalize the hybrid model defined in Section 2.2 
for the distributions that come from the exponential family. 
In other words, all of the distributions (generative, discrim- 
inative and coupled prior) of the generalized hybrid model 
belong to the exponential family. We first provide the def- 
initions of discriminative and generative models in terms of 
exponential family. 

Generative model: 

p(x, y\0) = /i(x, y) exp((^, T(x, y)) - A{0)) (3) 
Discriminative model: 

p(2/|x, 0) = g{y) eM{0. T(x, y)) - B{9, x)) (4) 

Next, we break the coupled prior p{9, 0) into two parts; 
an independent prior on the discriminative parameters p{9) 
and a prior on the generative parameters given discriminative 
parameters p{9\9). This formulation lets us model the depen- 
dency of the generative component over the discriminative 
component. Our new hybrid model is now defined as: 

p(x, y, e, 9) = [p(0)p(y|x, 9)] pm [ 53p(x, y'\e)] (5) 

y' 

For convenience and interpretability (later we will show that 
it also improves the performance), we choose the coupled 
prior p{9\0) to be conjugate with the generative model. 

Conjugate prior: 

p{0\9) - m{9) cxp({9, a{9)) - mMO)) (6) 
Here, a(-) and /?(•) are user-defined functions that map the 
discriminative parameters 9 into hyperparameters for the con- 
jugate prior. We discuss suitable choices of these functions in 
Section 3.4. 

Substituting the exponential definitions of generative 
model Eq (3), discriminative model Eq (4), and coupled prior 
Eq (6) in Eq (5), and taking a log, we obtain a log joint prob- 
ability of data and parameters: 

X = logp(x,y,0,^) = (7) 
logp(6l) + 

\ogm{0) + {0a{9)) - I3{0)A{0) + 
log5(y)+ "E [{0,T{^,y)) - B{0,x) 



lo: 



E 

E 

v' 



+ 



.gE [/i(x,2/')exp((e~,T(x,2/')) - A{0)) 



Note that here discriminative part is defined only for labeled 
data while generative part is defined for both labeled and un- 
labeled data. 

3.2 Parameter Optimization 

We perform parameter optimization by a coordinate descent 
method, alternating between optimizing the discriminative 

parameters 6 and optimizing the generative parameters 6. 

For the generative parameters, we take the partial deriva- 
tive of the log probability in Eq (7) with respect to 9\ 

^ = a{9)-mA'{e) + 
89 

^^p{y'\^,§)iTM-A'ie)) 

Here, p{y'\x., 9) is the probability based on the parameters 

estimated in the last iteration p(y'|x, 9oid)- Substituting this 
in the above equation and setting it equal to zero, we obtain: 

ExeD E,' eow)r(x, y') + a{e) 



A\e) = 



Here A' {9) denotes the partial derivative of A{9) with re- 
spect to 9. As discussed in Section 1, choosing a conjugate 
prior gives us a closed form solution for A' {6). From Eq (2), 
we know that A' {9) is equivalent to the expected sufficient 
statistics of the generative model. 

Having solved for the generative parameters 6, we now 
solve the hybrid model for discriminative parameters 9. 

dL diogpje) d\ogm{e) ~ , 

m = do + 09 +^«(^)- 

(i'{e)A{§)+ Yl (nJ/,x)-i3'(0,x)) (9) 

(x,;/)SDt 

There is no closed form solution to the above expression 
therefore we solve it using numerical methods. In our im- 
plementation, we use stochastic gradient descent. 

3.3 Hybrid Multiple Binomial Model 

In this section, we see how this hybrid model can be applied 
in practice. We first choose a generative model that is suitable 
to our application. We next choose the coupled prior conju- 
gate to the generative model. Since later on, we intend to use 
the hybrid model for the document classification task, we use 
a naive Bayes^ (NB) model for the generative part and logistic 
regression for the discriminative part, akin to the study of Ng 
and Jordan [2002]. The generative part of our model (naive 
Bayes) is given by: 

p{y,'x\-K,v) = p{y\-K)p{yi\y,v) 

fe d 



^It should be noted that "naive Bayes" classifiers come into (at 
least) two different versions: the "multivariate Bernoulli version" 
and the "multinomial version" [Mccallum and Nigam, 1998]. Be- 
cause of its generaUty, in our implementation, we use multivariate 
Bernoulli. 



Here, -Kyi = 1 and < Vyd < 1. l{j,=fe} is an indicator 
function that takes value 1 if y = A: and otherwise. 
The discriminative part is: 



p{y\x,w,b) 



1 

z: 



exp (by + Y^XdWyd^ (II) 



Where Zx = J2y' ^^P {K' + J2d ^dWy'd) is a normaliza- 
tion constant. Note here that since these models form genera- 
tive/discriminative pair, number of parameters is same in both 
models. It is easy to see that there is one-to-one relationship 
between these two sets of parameters, by in the discriminative 
model behaves similar to Wy in the generative model, and Wyd 
behaves similar to Vyd- Since Wyj and Vyd are the parameters 
that capture most of the information, we use coupled prior 
to couple these sets of parameters and do not couple by and 
TTy . It is important to note the difference between the canoni- 
cal parameters of the exponential family representation of the 
model and the mean parameters. In the generative(or discrim- 
inative) model, 9yd (or 9yd) denote the canonical parameters 
while Vyd (or Wyd) denote the mean parameters. 

Having defined the appropriate discriminative and gener- 
ative models, now we can get equivalent exponential family 
forms of these models. First we show the exponential form 
of the generative model. The generative model in Eq (10) 
can be broken into two parts: one is class probability p(?y|7r) 
and other class conditional probability p{x.\y,v). Since the 
parameters of these distributions are independent, we can get 
their exponential representations separately. Considering the 
class conditional probability for one feature, Eq (10) can be 
written in the following form: 



P{^d\y, Vyd) = exp ( Xd log 



Vyd 



1 - Vyd 

Comparing this with Eq (3) gives 9yd 



+ 10g(l - Vyd) 



log 



A{9yd) = log(l + e*f'') andT(j/,x) = x^. Substituting these 
along with the appropriate conjugate prior in Eq (8) gives us 

a closed form solution for A' {9yd), which, in the naive Bayes 
model is equal to Vyd- 



A'{dyd) = Vyd 



Exeo Pjyl^' Goid)-x-d + a{0) 

N + p{e) 



(12) 



In other words, Vy^ is the normalized expected count of the 
dth feature in class y, with smoothing parameters that are 
controlled by the coupled prior hyperparameters a{6) and 

Next we solve for -Ky by direcdy optimizing the objective 
function Eq (10) with respect to -Ky with the given constraints. 

which is the normalized 



This gives us tTj, = ^ 
expected number of examples in class y. 

Having solved for generative parameters, we now solve for 
the discriminative parameters. Ideally, we would like to first 
get an equivalent exponential form of Eq (11) and then solve 
it using Eq (9). Since Eq (9) is only defined for discrimina- 
tive parameters that are coupled {w), we can not use Eq (9) 
unless we break Eq (1 1) into two exponential forms separate 
for w and b and, it is not clear how to do so. Therefore, we 
solve for discriminative parameters directly, without convert- 
ing Eq (1 1) into exponential form. It is important to note here 



that mean parameters in Eq (1 1) is equal to the canonical 
parameters 9yd- We place Gaussian prior p(6') — N{9\0, a^) 
onw — 9 and an improper uniform prior on b. Taking deriva- 
tives, we obtain: 



" 2 + a + {Oyda{wyi)) - l3'iwya)A{eya) 

yd avjyd 



dm. 



dL 
dby 



Y {l{a=y'} - ^ exp(6j; + ^Xdu;yd)j 



(x,i;')6Dl 



3.4 Conjugate Beta Prior 

Recall that our conjugate prior crucially depends on two func- 
tions: a{9) and j3{9) that "convert" the discriminative param- 
eters 9 into a prior on the generative parameters p{9\9). In the 
case of the binomial likelihood, the conjugate prior is Beta. 
Exponential form of Beta prior is defined as: 

P{dyd\9yd) = m{9yd) eXp{9yda{9yd) - (3{9yd)A{9yd)) 



r(/3(9„,)+2) 



and 



Where m{9yd) - r(a(e,,)+i)r(/3(0„,)-a(9„,)+i) 
A{9yd ^\ogil + e'y^). 

We select the function a{9yd) and l3{9yd) to be such that: 
(1) the mode of the conjugate prior is 9yd and (2) the vari- 
ance of the conjugate prior is controllable by the hyperpa- 
rameter 7. As noted from Figure 1, as 7 goes to 00, vari- 
ance goes to and prior forces generative parameters to 
be equal to the discriminative parameters (pure generative 
model) and as 7 goes to 0, variance goes to 00 which im- 
plies the independence between generative and discrimina- 
tive parameters (pure discriminative model). Other values of 
7 interpolate between these two extremes. Thus, we choose 
a{9yd) = 7/(1 + e^^y^) and (3{9yd) = 7. This gives mode 
of p{9yd\9yd) at 9yd with the variance that decreases in 7, as 
desired. 

It is important to note that our choice of hyperparam- 
eters for the conjugate prior is not specific to this exam- 
ple, but holds true in general. In the general case, let A 
be the log-partition function associated with the generative 
model, then, the conjugate prior hyperparameters should be 
a{9) — ^A' [9) and j3{9) — 7. This gives us the mode of con- 
jugate prior at 9 with the variance that decreases in 7. In the 
beta/binomial hybrid model. A' [9) = A'{w) = 1/(1 + 6""'). 
Also note that in the beta/binomial example. A' (9) is also the 
transformation function T that transforms the discriminative 
mean parameters w to the generative mean parameters v. 

In Figure 1, we also compare the Beta prior (solid blue 
curves) to an "equivalent" logistic-Normal prior (dashed 
black curves) for four settings of 7. The logistic-Normal is 
parameterized to have the same mode and variance as the 
Beta prior. As we can see, for high values of 7 (wherein the 
model is essentially generative), the two behave quite simi- 
larly. However, for more moderate settings of 7, the priors 
are quahtatively quite different. 





0.2 0.4 0.6 



0.2 0.4 0.6 



Figure 1: Effect of gamma on the Beta prior (solid curve) and 
logistic-Normal prior (dashed curve) for gamma=0.1, 1, 10, 
100 (top-left, top-right, bottom-left, bottom-right) and for the 
transformed discriminative parameter T{w) = 0.2 



4 Related Work 

There have been a number of efforts to combine generative 
and discriminative models to obtain a hybrid model that per- 
forms better than either individually. Some of the earlier 
works [Raina et ai, 2003; Bouchard and Triggs, 2004] use 
completely different approaches to hybridize these models; 
Raina et al. [2003] present a model for the document classifi- 
cation task where a document is split into multiple regions 
and complementary properties of generative/discriminative 
models are exploited by training a large set of the parameters 
generatively and only a small set of parameters discrimina- 
tively. Bouchard and Triggs [2004] build a hybrid model by 
taking a linear combination of generative and discriminative 
model. This model is similar to the multi-conditional learn- 
ing model presented by McCallum et al. [2006]. Jaakkola and 
Haussler [1999] describe a scheme in which the kernel of a 
discriminative classifier is extracted from a generative model. 
Though these models have shown to perform better than just 
the discriminative or generative model, none of them combine 
the hybrid model in natural way. 

Our work builds on the work of Lasserre et al. [2006] 
and Druck et al. [2007], which are discussed in Section 2.2. 
Along these lines, Fujino et al. [2007] present another hy- 
brid approach where a generative model is trained using a 
small number of labeled examples. Since the generative 
model has high bias, a generative "bias-correction" model is 
trained in a discriminative manner to discriminatively com- 
bine the bias-correction model with the generative model. 
Most of these work focus on the application and little on 
the theory of the hybrid model. There has been a recent 
work by Bouchard [2007] that presents a unified framework 
for the "PCP" model and the "convex-combination" model 
[Bouchard and Triggs, 2004], and proves performance prop- 
erties. 



Dataset 


No. of 
Features 


Dataset description 


movie 


24, 841 


classifies the sentiments of tlie review 
of the movies from IMDB as positive 
or negative 


webkb 


22, 824 


classifies webpages from university as 
student, course, faculty or project 


sraa 


77, 494 


classifies messages by the news- 
group to which they were posted: 
simulated-aviation, real-aviation, 
simulated-autoracing, real-autoracing 



Table 1: Description of the datasets used in the experiments 



5 Experiments 

5.1 Experimental Setup 

In this section, we show empirical results of our approach 
and compare them with the existing (and most related to our 
method) state-of-the-art semi-supervised methods [Druck et 
ah, 2007]. In order to have a fair comparison, we use ex- 
perimental setup of Druck et al. [2007] and perform experi- 
ments only for the datasets where PCP model have shown to 
perform best, There are three such datasets: movie, sraa and 
webkb. Description of these datasets is given in Table 1. 

Although all of the examples in these datasets are labeled, 
we perform experiments by taking a subset of dataset as la- 
beled and treating the rest of the examples as unlabeled. We 
use either 10 or 25 labeled examples from each class and vary 
unlabeled examples from to a maximum of 1000. Number 
of unlabeled examples are same in each class. We show our 
results for two sets of experiments: (1) we show how per- 
formance varies as we vary the number of unlabeled exam- 
ples; (2) we show how performance varies with respect to A. 
Here A normalizes the 7 G [00, 0] in the range of [0, 1] using 
7 = ((1 — A)/A)^. Now A = corresponds to the pure gener- 
ative case while A = 1 corresponds to the pure discriminative 
case. As in the work of Druck et al. [2007], the success of 
the semi-supervised learning depends on the quality of the la- 
beled examples, therefore we choose five random labeled sets 
and report the average on them. In our results, we report the 
percentage classification accuracy which is the ratio of num- 
ber of examples correctly classified to the total nimiber of test 
examples. 

5.2 Results and Discussion 

Results on the above mentioned three datasets are presented 
in Table 2. Table shows the results for the PCP model with the 
Gaussian prior (PCP-Gauss) and with the Beta prior (PCP- 
Beta). Since PCP-Beta uses the binomial version of NB, 
we reimplemented the PCP-Gaussfor the binomial version of 
NB and compare the results with it. Though we also show 
the results for PCP-Gauss multinomial [Druck et al, 2007], 
a fair comparison would be to compare only binomial mod- 
els. %change is the change in PCP-Beta with respect to the 
PCP-Gauss binomial version. As we see, PCP-Beta performs 
better than PCP-Gauss binomial in all experiments and bet- 
ter than PCP-Gauss multinomial in all experiments except 





pcp-vjuuss 


pcp-vjtiuss 




/C 




iVl Ul L 


Rin 
Dill 


Rin 
Dill 


Clltlll^C 


mnvip ( 1 0^ 


64.6 


63.4 (3.2) 




+7.7% 


movie (25) 


68.6 


69.0 (1.5) 


76.7 (1.2) 


+11.1% 


webkb (10) 


72.5 


73.7 (3.7) 


75.3 (2.9) 


+2.2% 


webkb (25) 


76.7 


83.8 (1.3) 


83.9 (1.6) 


+1.1% 


sraa (10) 


81.6 


67.7 (6.8) 


79.1 (4.0) 


+16.8% 


sraa (25) 


84.1 


76.6 (3.5) 


86.1 (1.0) 


+12.4% 



Table 2: Comparative results for pep with Gaussian prior and 
pep with Beta prior. Parenthesized values denote the number 
of labeled examples per class and the standard deviation. 



sraa(lO). Compared to PCP-Gauss binomial, PCP-Beta per- 
forms significantly better on sraa and movie datasets. 

Comparing multinomial and binomial versions of PCP- 
Gauss, we see that for movie and webkbb datasets, binomial 
version performs better (or almost equal) than the multino- 
mial while for sraa dataset, multinomial performs better. We 
conjecture that reason for this behavior could be because sraa 
has a large number of features and feature independence as- 
sumption is less violated in multinomial NB than in bino- 
mial NB. When datasets do not have too many features, bi- 
nomial version tends to perform better because binomial NB 
accounts for both presence and absence of the features, in 
contrast to multinomial NB which only accounts for the pres- 
ence of the features. 

Figure 2 and Figure 3 show the results for accuracy vs. 
A for different number of unlabeled examples for sraa and 
movie datasets respectively. Remember that A = is the 
purely generative model and A = 1 is the purely discrim- 
inative model. In both of these figures, we see that as we 
increase the number of unlabeled examples, performance im- 
proves. In sraa, we observe that increasing the number of 
unlabeled examples results in the shifting of optimal A (A*) 
towards rights. We get an optimal A* = 0.2 for a fully su- 
pervised model while for 1000 unlabeled examples, we get 
A* = 0.5. All the curves in this experiment are uni-modal 
which means that there is a unique value of A where hybrid 
model performs best. 

Unfortunately, these nearly-perfectly shaped curves are not 
common to all settings. We do not observe it in the other 
dataset (Figure 3). There are values of A where a fully su- 
pervised model performs better than the best semi-supervised 
model. This experiment emphasizes the need for choosing the 
right value of A and also shows the importance of the hybrid 
model. If we do not choose the right value of A, we might 
end up hurting the model by using the unlabeled data. We 
also observe that movie dataset gives us a bi-modal curve in 
contrast to the uni-modal curve obtained in the sraa. We see 
that curve is a uni-modal in the supervised setting but as we 
introduce unlabeled examples, the curves not only become 
bi-modal but also shift towards the left-hand side (best accu- 
racy is achieved close to the generative end). This naturally 
suggests that generative model is actually affecting the hybrid 
model in a positive manner and exploiting the strength of the 
unlabeled examples. 
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Figure 2: Results for sraa dataset for different number of un- 
labeled examples. Number of labeled examples=10. 



Figure 3: Results for movie dataset for different number of 
unlabeled examples. Number of labeled examples=25. 



6 Conclusion and Future Work 

We have presented a generalized "PCP" hybrid model for 
the exponential family distributions and have experimentally 
shown that the prior conjugate to the generative model is more 
appropriate than the Gaussian prior In addition to the perfor- 
mance advantage, the conjugate prior also gives us a closed 
form solution for the generative parameters. In the future, 
we aim at interpreting these results in a theoretical way and 
answer questions like: (1) Under what conditions will the hy- 
brid model perform better than both the generative and dis- 
criminative models? (2) What is the optimal value of 7? (3) 
Is a R4C-style analysis of the hybrid model possible for the 
finite sample case as opposed the asymptotic analysis mostly 
found in the literature? 

References 

[Bouchard and Triggs, 2004] G. Bouchard and Bill Triggs. The 
tradeoff between generative and discriminative classifiers. In 
lASC International Symposium on Computational Statistics, 
pages 721-728, Prague, August 2004. 

[Bouchard, 2007] Guillaume Bouchard. Bias-variance tradeoff in 
hybrid generative-discriminative models. In Proceedings of the 
Sixth International Conference on Machine Learning and Ap- 
plications, pages 124-129, Washington, DC, USA, 2007. IEEE 
Computer Society. 

[Cozman e/ a/., 2003] Fabio Gagliardi Cozman, Ira Cohen, 
Marcelo Cesar Cirelo, and Escola Politcnica. Semi-supervised 
learning of mixture models. In 20th International Conference on 
Machine Learning, pages 99-106, 2003. 

[Druck et al., 2007] Gregory Druck, Chris Pal, Andrew McCallum, 
and Xiaojin Zhu. Semi-supervised classification with hybrid gen- 
erative/discriminative methods. In Proceedings of the 13th ACM 
SIGKDD international conference on Knowledge discovery and 
data mining, pages 280-289, New York, NY, USA, 2007. ACM. 

[Fujino e? a/., 2007] Akinori Fujino, Naonori Ueda, and Kazumi 
Saito. A hybrid generative/discriminative approach to text clas- 
sification with additional information. Inf. Process. Manage., 
43(2):379-392, 2007. 



[Jaakkola and Haussler, 1999] Tommi S. laakkola and David Haus- 
sler Exploiting generative models in discriminative classifiers. 
In Advances in Neural Information Processing Sy. stems 1 1 , pages 
487-493. MIT Press, 1999. 

[Lasserre a/., 2006] Julia A. Lasserre, Christopher M. Bishop, 
and Thomas P. Minka. Principled hybrids of generative and dis- 
criminative models. In Proceedings of the 2006 IEEE Computer 
Society Conference on Computer Vision and Pattern Recognition, 
pages 87-94, Washington, DC, USA, 2006. IEEE Computer So- 
ciety. 

[Mccallum and Nigam, 1998] A. Mccallum and K. Nigam. A com- 
parison of event models for naive bayes text classification. In 
AAAI Workshop on " Learning for Text Categorization" , 1998. 

[McCallum ef a/., 2006] Andrew McCallum, Chris Pal, Greg 
Druck, and Xuerui Wang. Multi-conditional learning: Genera- 
tive/discriminative training for clustering and classification. In 
Proceedings of the 21st National Conference on Artificial Intelli- 
gence, pages 433-439, 2006. 

[Ng and Jordan, 2002] Andrew Y. Ng and Michael I. Jordan. On 
discriminative vs. generative classifiers: A comparison of logistic 
regression and naive bayes. In Advances in Neural Information 
Processing Systems 14, Cambridge, MA, 2002. MIT Press. 

[Nigam et al. , 2000] Kamal Nigam, Andrew K. McCallum, Sebas- 
tian Thrun, and Tom Mitchell. Text classification from la- 
beled and unlabeled documents using em. Machine Learning, 
V39(2): 103-1 34, May 2000. 

[Raina a/., 2003] Rajat Raina, Yirong Shen, Andrew Y. Ng, 
and Andrew McCallum. Classification with hybrid genera- 
tive/discriminative models. In Advances in Neural Information 
Processing Systems 16. MIT Press, 2003. 

[Zhu, 2005] Xiaojin Zhu. Semi-supervised learning literature sur- 
vey. Technical Report 1530, Computer Sciences, University of 
Wisconsin-Madison, 2005. 



